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5 MULTIVARIATE RANDOM SEARCH METHOD WITH 
MULTIPLE STARTS AND EARLY STOP FOR 
IDENTIFICATION OF DIFFERENTIALLY EXPRESSED 
GENES BASED ON MICROARRAY DATA 

10 BACKGROUND OF THE INVENTION 
FIELD OF THE INVENTION 

The present invention relates in general to statistical analysis of 
microarray data generated from nucleotide arrays. Specifically, the present 

15 invention relates to identification of differentially expressed genes by 

multivariate microarray data analysis. More specifically, the present invention 
provides an improved multivariate random search method for identifying large 
sets of genes that are differentially expressed under a given biological state or 
at a given biological locale of interest The method of the invention 

20 implements multiple starts and early stop in the random search of sets of 
differentially expressed genes. 

DESCRIPTION OF THE RELATED ART 

Gene expression analyses based on microarray data promises to open 
25 new avmues for researchers to unravel the functions and interactions of genes 
in various biological pathways and, ultimately, to uncover the mechanisms of 
life in diversified species. A significant objective tn such expression analyses 
is to identify genes that are differentially expressed.in different cells, tissues, 
organs of interest or at different biological states. So identified, a set of 
30 differentially expressed genes associated with a certain biological state, e.g., 
tumor or certain pathology, may point to the cause of such tumor or pathology, 
and thereby shed light on the search of potential cures. 
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In practice, however, gene expression studies are hampered by many 
difficulties. For example, poor reproducibility in microarray readings can 
obscure actual differences between normal and pathological cells or create false 
positives and false negatives. The tension between the extremely large number 
of genes present (hence high dimensionality of the feature space) and the 
relatively small number of measurements also poses serious challenges to 
researchers in making accurate diagnostic inferences. 

Existing methods for selecting differentially expressed genes are 
typically univariate, not taking into account the information on interactions 
among genes. As appreciated by an ordinary skilled molecular biologist, genes 
do not operate in isolation - activation of one gene may trigger changes in the 
expression levels of other genes. That is, genes may be involved in one or 
more pathways. Therefore, determination of differentially expressed genes 
calls for consideration of covariance structure of the microarray data, in 
addition to, for example, mean expression levels. In this regard, however, 
application of well-established statistical techniques for multidimensional 
variable selection encounters much difficulty. This is so because, in one 
aspect, the small number of independent samples and flie presence of outliera 
make the estimates on selected variables unstable for large dimensions. In 
other words, only small sets of genes can be meaningfully considered while a 
relatively large number of genes axe poteatially differentially expressed. It is 
generally impossible to compare all gene subsets and find the qitimal one 
because the numb^ of possible gene combinations is prohibitively large. On 
the other hand, if a global optimum could be found, it might be overiy specific 
to a training sample due to overfitting. Thus, it remains a significant challenge 
to scale methods for identifying differentially expressed genes to deal with 
microarray data of high dimensional space. 

Therefore, there is a need to address the difficulties in applying 
multivariate analysis to microarray data - a need to establish rigorous methods 
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for identificatioTi of differentially expressed genes fhwn high dimensional gene 
expression data. 



SUMMARY OF THE INVENTION 



5 It is therefore an object of this invention to provide multivariate methods 

for analyzing microarray gene expression data of high dimensional space and 
thereby identifying differentially expressed genes. Particularly, it is an object 
of this invention to provide methods for identifying larger sets of differentially 
expressed genes starting from feature spaces of smaller dimensionality where 
10 accurate estimates on covariance matrix can be made. More particularly, the 
present invention provides a random search method with multiple starts and 
early stop. 

In accordance with the present invention, there is provided methods for 
identifying a set of genes from a multiplicity of genes whose expression levels 

15 at a first and a second state, in a first and a second tissue, or in a first and a 
second types of cells are measured in replicates using one or more nucleotide 
arrays, thereby generating a first plurality of independent measurements of the 
expression levels for the first state, tissue, or type of cells and a second 
plurality of independent measurements of the expression levels for the second 

20 state, tissue, or type of cells. The method comprises: (a) identifying a quality 
function enable of evaluating the distinctiveness between the first plurality 
and the second plurality; (b) selecting a subset of genes, whose expression 
levels in tihe first and second states, tissues, or types of ceUs are represented in 
the first plurality and the second plurality, respectively; (c) calculating the 

25 values of the quality function for the subset of genes in the first state and said 
second state based on the first and second plurality, thereby determining the 
distinctiveness of the first and the second plurality; (d) substituting a gene in 
the subset with one outside of the subset, thereby generating a new subset, and 
repeating step (c), keeping the new subset if the distinctiveness increases and 
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the original subset if otherwise; (e) repeating steps (c) and (d) for a first 
predetermined number of times, thereby identifying a locally optima] subset of 
genes; (f) repeating steps (b) to (e) for a second predetermined number of 
times, thereby identifying the second predetermined number of the locally 
5 optimal subsets; and (g) integrating the second predetermined nimiber of the 
locally optimal subsets into the set of genes, wherein the set is larger than the 
locally optimal subsets in size. 

According to the present invention, in certain embodiments, the states 
may be biological states, physiological states, pathological states, and 

10 prognostic states. In other embodiments, the tissues may be normal lung 
tissues, cancer hing tissues^ normal heart tissues, pathological heart tissues, 
normal and abnormal colon tissues, normal and abnormal renal tissues, normal 
and abnormal prostate tissues, and normal and abnormal breast tissues. In yet 
other embodiments, the types of cells may be normal lung cells, cancer lung 

15 cells, normal heart cells, pathological heart cells, normal and abnormal colon 
cells, normal and abnormal renal cells, normal and abnormal prostate cells, and 
normal and abnormal breast cells. In still other embodiments, the types of cells 
may be cultured cells and cells isolated from an oiganism. 

According to another embodiment of this invention, the integrating is 
20 performed by selecting the genes whose £requency of occiurences in the second 
predetermined number of the locally optimal subsets exceeds a third 
predetermined number. In certain embodiments, tiie third predetermined 
number is 1% or S%. According to yet another embodiment, the first 
predetermined number is sufficiently small such that the global maximum is 
25 not reached. According to still another embodiment, the quality function is a 
parametric function or a non-parametric function. In a fiirflier embodiment, the 
parametric function is selected firom the group consisting of the Mahalanobis 
distance and the Bhattachaiya distance. 
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In various embodiments of the invention, the nucleotide arrays may be 
arrays having spotted thereon cDNA sequences and/or arrays having 
synthesized thereon oligonucleotides. 

BRIEF DESCRIPTION OF DRAWINGS 

Fig. 1 depicts the steps of multivariate random search with multiple 
starts and early stop according to one embodiment of the invention. 

Fig. 2 shows the differences of gene selection using multivariate random 
search with early or late stop according to various embodiments of the 
invention. First row are histograms of the values from the "last best iteration" 
in the Ncycic search. Second row are histograms of the estimated Mahalanobis 
distances for the Ncycie selected sets. Third row are histograms of the frequency 
of occurrences of the differentially expressed genes (1-20) in one of the 
selected sets. 

Fig. 3 shows ROC curves for various values of Nfter controlling the 
stopping time based on 10 simulated data sets, eixor bars depicting the 
corresponding standard errors. 

Fig. 4 shows the differences of gene selection from same or different 
tissues using multivariate random search wifli early or late stop according to 
various embodiments of die invention. First row are histograms of the values 
of the 'last best iteration" in the Ncycie searches. Second row are histograms of 
flie estimated Mahalanobis distances for the Ncycie sub-optimal sets. 

Fig. 5 shows flie differences of flie frequency of inclusion in the selected 
locally optimal set using multivariate random search according to one 
embodiment of the invention, applied to same or different tissue samples and 
with or without controls. 
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DETAIL DESCRIPTIONS OF DISCLOSURE 
Definition 

As used herein, the term "microarray" refers to nucleotide arrays; 
"array," "slide," and "chip" are used interchangeably in this disclosure. 
Various kinds of nucleotide arrays are made in research and manufacturing 
facilities worldwide, some of which are available commercially. There are, for 
example, two kinds of arrays depending on flie ways in which the nucleic acid 
materials are spotted onto the array substrate: oligonucleotide arrays and cDNA 
arrays. One of the most widely used oligonucleotide arrays is GeneChip™ 
made by Affymetrix, Inc. The oligonucleotide probes that are 20- or 2S-base 
long are synthesized in silico on the array substrate. These arrays tend to 
achieve high densities (e.g., more than 40,000 genes per cm^). The cDNA 
arrays, on the other hand, tend to have lower densities, but the cDNA probes 
are typically much longer than 20- or 25-mers. A representative of cDNA 
arrays is LifeArray made by Incyte Genomics. Pre-synthesized and amplified 
cDNA sequences are attached to the substrate of these kinds of arrays. 

Microarray data, as used herein, encompasses any data generated using 
various nucleotide arrays, including but not limited to those described above. 
Typically, microarray data includes collections of gene expression levels 
measured using nucleotide arrays on biological samples of different biological 
states and origins. The methods of the present invention may be employed to 
analyze any microarray data; irrespective of the particular microarray platform 
from which the data are generated. 

Gene expression, as used herein, refers to the transcription of DNA 
sequences, which encode certain proteins or regulatory functions, into RNA 
molecules. The expression level of a given gene refers to die amount of RNA 
transcribed therefrom measured on a relevant or absolute quantitative scale. 
The measurement can be, for example, an optic density value of a fluorescent 
or radioactive signal, on a blot or a microarray image. Differential expression. 
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as used herein, means that the expression levels of certain genes are different in 
different states, tissues, or type of cells, according to a predetermined standard. 
Such standard maybe determined based on the context of the expression 
experiments, the biological properties of the genes under study, and/or certain 
statistical significance criteria. 

The terms ^Vector," "probability distance," "distance,*' "the 
Mahalanobis distance," "the Euclidean distance," "feature," "feature space," 
"dimension," "space," "type I error," "type II error," and "ROC curve" are to be 
understood consistently with their typical meanings established in the relevant 
art, i.e. the art of mathematics, statistics, and any area related thereto. For 
example, a set of microarray data on p distinct genes represents a random 
vector X = X], . . Xp with mutually dependent components. 

Improved Random Search Procedure with Multiple Starts and Early Stop 

Random search algorithms have been used for finding optima in 
complex combinatorial spaces. See, e.g., Zhigljavsky AA., Vol. 65, 
Mathematics and its Applications, Kluwer Academic Publishers Group, 
Dordrecht, 1991 . The improved random search procedure according to one 
embodiment of this invention applies a local search procedure multiple times 
and then integrates the selected sets of genes to build a global optimal set of 
differential expressed genes. To prevent overfitting, short local searches may 
be performed. Local maximum regions are carefully examined and 
convergence to a unique global maximum is avoided. The method can be 
applied in conjunction with a variety of parametric and non-parametric quality 
functions, which are discussed in more detail in the next section. In certain 
embodiments, the improved random search procedure with multiple starts and 
early stop includes the following steps: 

1 . Randomly select Nsnbsei genes from Naij, wherein Nsubsei is the number 
of genes in a subset, Naji is the total number of the genes, and Nsubsei is smaller 
then Naij- 
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2. Evaluate flie quality function for the N„b^ genes. 

3. Generate a new evaluation point (i.e., starting point) by swapping one 
or more randomly selected genes between the currently selected set and the rest 
of Ae genes, thereby identifying a new Nsutset- 

4. Evaluate the quality function for the new N^^set genes; if its value has 
decreased, then return to the previous Ns„bset, otherwise maintain the new 

5. Repeat steps 3 and 4 until the number of iterations reaches a 
predetermined number - let it be Niu, - then save the resultant N«,te« at that 



10 6. Repeat steps 1-5 Ncycie times. 

7. Evaluate the resultant N^ycie groups ofN^ genes to identify an 
integrated larger set of genes. 

In step 7, a post-processing step, the local optima are combined to 
provide a final, global solution, i.e., an integrated larger set of differentially 
15 expressed genes. Heuristically, strongly differentially expressed genes should 
appear in many of the local maxima. Therefore, each gene to be included in 
the final set of differentially may be identified based on the fi-equency of its 
occurrence m the sub-optimal (i.e., locally optimal) sets derived fix>m each of 
the Ncycie cycles, as performed in steps 1-6 above. A conservative estimate of 
20 the p-value corresponding to the observed fi-equency can be calculated. For 
example, if a gene is not differentially expressed, the probability that it will be 
in the selected subset by chance is expected to be equal to 'N^oset I Nan, and 
most likely smaller. As the number of repetitions Neyde is lai^e, the final 
selection firequency of this gene may be approximated by a Poisson distribution 
25 withameanNcj^e*N^set/Non. Based on fliis null-distribution the 
. corresponding p-values for each gene may be calculated. 



point 
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Generally, Nsubsctis limited by the number of available training samples 
(e.g., the number of microarray slides in a typical experiment) and hence, Ns„bsci 
may be significantly smaller than Nan- Depending upon the particular quality 
function of choice, the nature and the extent of this limitation may vary; but, 
5 generally, both parametric and non-parametric criteria are sensitive to the 
scarceness of training samples in a high-dimensional feature space. In this 
connection, one significant advantage of the improved random search method 
disclosed herein is that, the detectable number of the differentially expressed 



10 interaction structure (e.g., the covariance matrix) may be affected. In other 
words, a relatively large set of differentially expressed genes may be identified 
by integrating the subsets of genes selected from multiple local searches. In 
some embodiments of this invention, the final set of differentially expressed 
genes is significantly larger in size than the subset identified in the local search, 

15 i.e., the locally optimal subset: Ngubsef 

The determination of Niter is crucial for preventing overfitting. It cannot 
be too small because a small value may not permit finding truly differentially 
expressed genes. On the other hand, too large a number will not be efficient. 
When the value is too big, the same maximum may be attained in many 
20 iterations of search because of overfitting. 

With regard to Ncycio this is a number that substantiates the variability of 
this random search procedure. It may be as large as possible, only limited by 
the applicable CPU power (e.g., N^yde = 1,000,000 may be used). 

Quality Functions Used in Conjunction with tho Random Searcit 
25 Procedure 

A variety of quality functions may be used in conjunction with the 
improved random search procedure in various embodiments of this invention. 
A quality function measures the "distinctiveness" of the two tissues or two 
30 biological states under comparison based on a set of genes, taking into account 



genes is not limited by Njubseb even though the depth of the estimated 
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the correlation structure. Generally, properly specified parametric methods are 
more powerful than non-parametric methods due to the utilization of additional 
information accounted in the model, although such parametric quality functions 
may be sensitive to any departure from the model. With microarray data, since 

5 small sample sizes are a prevalent problem, choosing an appropriate parametric 
quality function may be advantageous in its power, whereas a non-parametric 
random search method may be more robust. One parametric measure of the 
differences between two multidimensional samples is the Mahalanobis 
distance, which is used in one embodiment of tiiis invention. See, Mahalanobis 

10 PC, Proceedings of the National Institute of India (1936) 2 Vol. 49, 



where v and u are the sample means and , Zv are the two sample 
variance-covariance matrices. It is a natural extension of the /-statistic to a 
multidimensional setting. Because of the matrix inverse involved, the 

15 calculation of the Mahalanobis distance at every step of the search - for Ncycie * 
Niter times - may appear to be prohibitive . However, with the improved 
random search procedure of this invention, changes in the vectors are only in 
one dimension on every step (see supra, steps 1-5); therefore, a fast update 
formula may be permitted. See, e.g., McLachlan GJ., Discriminant Analysis 

20 and Statistical Pattern Recognition, (1 992) Wiley, NY. 

In another embodiment of this invention, the Bhattacharya distance may 
be used, especially where differences in both the mean and the covariance 
structure are of interest. 




i_5 a, I — ^ — ' 
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Similarly, other parametric or non-parametric dissimilarity measures 
may be used in various alternative embodiments in conjunction with the 
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improved random search procedure disclosed herein. Such different choices of 
quality functions each may be designed to deal with microarray data with 
different characteristics. 

Further, when using various quality functions, various background 
5 reduction, normalization, and other adjustment procedures may be applied to 
tfie microarray data. For example, rank-based adjustment and the typical 
mean-log adjustment (dividing by mean and take logarithm) may be used. In 
one embodiment, the following adjustment is implemented: the data points on 
each slide or array were replaced by their normal scores using the formula 

where O"^^ (a) is tihe 1 00a* percentile of the standard normal 
distribution and rankj Xi j is the rank of among all of the observations on 
the j* slide. See, Tsodikov A. et al., (2002) Bioinformatics 1 8: 251-260. 



Computer Simulation of the Multivariate Search Method 

15 

A simulation study was performed to evaluate the improved random 
search method Totally 1000 genes were divided into subsets of equal size 20. 
One of flie subsets was selected to be deemed as differratially cTipressed with 
the gene-specific ratio d randomly genefrated for each of the genes from a 

20 lognormal distribution with mean 1 and variance 0.5. The correlation structure 
was kept the same in the two hypothetic tissues. In the selected subset some of 
Hie genes exhibited large over- or under-expression, while others with d » 1 
changed their expression level only slightly. The simulation was performed on 
20 slides or arrays with one of the tissues on the green channel and the other on 

25 the red channel. The relevant parameters for the random search were set: Ncycie 
= 10,000, Nsubsct = 5; and, the Mahalanobis distance was used as the quality 
function. 
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Referring to Fig. 1 , the results of the random search procedure are 
compared between = 1000 (in the left panels) and Nii„ = 1 00,000 (in the 
right panels). The two graphs on the top show the histograms of the values of f 
the "last good iteration" - the number of iterations after which no new 
5 successftil steps were encountered (i.e., when no new subset was found any 
more at stqp 4 of the aforementioned procedure and thus the final set was 
determined)* The two histograms demonstrate that 1000 iterations were a little 
less than sufficient to reach the global maximum, whereas 10,000 iterations 
were more than raough for the random search to converge. 

10 The middle graphs illustrate the same phenomenon in another way. In 

the case of early stopping, i.e., when Njier - 1000, the distribution of the 
Mahalanobis distances corresponding to the Ncycie sub-optimal sets is unimodal 
with high variability. Thus, the procedure has explored many different local 
maxima with a variety of corresponding values of the quality function. On the 

15 other hand, when the number of iterations increase, e.g., when Niter 100,000, 
the distribution of the Mahalanobis distances achieved in the sub optimal sets 
became very discrete. In about half of the cases the search reached the global 
maximum on a imique combination of genes. Therefore, in this situation, 
although the global maximum was found, many local maxima and the 

20 corresponding dijQTerentially e?q>ressed genes from the various subsets were 
missed. When early stop is carried out at the 1000-th iteration, none of 10,000 
cycles found the global maximum, but a variety of genes were selected. 

At the bottom panel of Fig. 1, the frequencies of selection for the 20 
genes in the differentially expressed gene set are plotted. The x-axis represents 

25 the number of the genes: from gene No. 1 to No. 20. With Nfter = 1,000, i.e., 
when the early stop was implemented, 17 from 20 genes pass the selection 
criteria (predetermined to be a frequency of occurrence higher than 0.5%). 
Wifli NiMT = 100,000, i.e., when the early stop was not implemented, only 10 
genes met the 0.S% frequency standard when the global maximum was 

30 attained. 
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Referring to Fig. 2, the ROC curves corresponding to values oflin„ 
ranging from 100 to 10,000 based on 10 independently simulated data sets 
were plotted. Other parameters were held constant, that is, Njubsei = 5, Ncycie = 
10,000. For each search, a list of genes with associated frequencies of 
occurrence in the selected subsets were complied and a final set of 
differentially expressed genes was identified by applying cutoif values ranging 
from 0.1% to 10% in frequency. Based on the null hypotheses of no 
differential expression, for each of these sets, the ratio of type I error (i.e., the 
false positive) was defined as the proportion of non-differentially expressed 
genes that was selected into the final set. And the ratio of the type II error (i.e., 
the false negative) was defined as the proportion of the genes in the 
differentially expressed subsets that were not included in the final set The 
resulting ROC curves are shown in Fig. 2. Also, as a reference, the point 
representing the type I error and the power of the marginal /-test with 5% 
significance level is also plotted (referring to the star in Fig. 2). Comparing the 
ROC curves in Fig. 2, a skilled artisan can note that the value of Niier 
significantly affects the performance of the random search procedure: long 
searches are inferior to searches with early stop. There ought to be, however, a 
limit on how early the search should stop, because very short searches are not 
likely to reach any local maxima. According to Fig. 2, the best performance 
was achieved when Ni,er = 500. 

The invention is further described by the following examples, which arc 
iUustrative of the invention but do not limit the invention in any manner. 

Example 1: a Detailed Illustration of Random Search with Multiple-Starts 
and Early Stop 

Referring to Fig. 3, suppose there are p genes and n and m independent 
samples in the two classes respectively, this procedure finds a group of genes 
diflferentially expressed in these classes using information on the ^-variate 
dependence structure. 
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1 . Repeat the following Niter times. Ni,er is not too large; early stop - 
stop before convergence - is implemented. 

a. Randomly select k genes (genes 2 to gene k in Fig. 3) that 
will serve as the seed of the random search. 

b. Calculate the distance between the two classes based on 
the k initially selected genes. 

c. Randomly select a gene (e.g., gene 2 in Fig. 3) from the 
current gene set (gene 2 to gene k in Fig. 3), remove it from the set and replace 
it with a gene randomly selected from outside of the set (e.g., any of gene JH-1 
to gene p in Fig. 3, let it be gene x). 

d. Calculate the distance between the two classes based on 
the new gene set (gene 3 to gene k, plus gene x). If this distance is larger than 
the previously calculated one, then keep the change, otherwise revert to the 
previous set. 

e. Retain the selected sub optimal set of genes, i.e., the set 
that has the largest distances between the two classes. 

2. Repeat step 1 Ncycic times, obtain Ncycie sets of genes of size 

3. For each gene, calculate the frequency of its occurrence as a 
member of a sub-optimal set. 

1) 4. The final set of genes is defined as the genes that have a 
frequency of occurrence exceeding a preset limit 

Example 2: a Source Code Segment implementing Random Search wiOi 
Multiple Starts and Early Stop - Step 1 and 2 of Example 1 

Program gene 1 
c 

parameter (nall=1000, ncl==10, iiitei=500, m=204=2^t=<2) 
parameter (ishift=3000jacyCLE=1000) 
parameter (genadd-5.tdisp»l .,delyugr=2.) 
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parameter (expniaxr=20.,strang=l.e-15) 
parameter (kcl=5,iap=*l,nex=10) 

parameter {pat=l .5,dpat=0.,iTailty"0^,ncls=20,piirity=0.8S) 

c 

CH ARACTER*50 jmode,qua1it, ranf Jai,stat,5tart^onnal,mixup 
CHARACTER*50 sound,!!! 
DIMENSION AP(L*IAP),DEL(M*1) 

DIMENSION DEN((KCL+2)*L),PST(L)J)FM(L*(KCL+2)*L*iap) 
DIMENSION F(KCL+2),DS{M*L*L*(KCL+2)) 
DI]S4UBNS10N Dl(nc!)J)ETER{L),raiikl{m),rank2(m) 
c 

dimension eir(kcl+2),g((kcl+2)*l),cnt(l) 
c 

Dimension inmii(ncI),b(nal!*m*l),a(naIl*m*l),cl(ncl*m*0»"(in*D 

dimension e(ncl*nc]Xito(l),ind3(niter) 

dimension el {ncl*ncl),e2(ncl*ncl),e3(ncl*ncl)^nex*nex) 

dimension jnibest(ncl),x(m*l),v(nall),m22{m*l),ind2(nall) 

dimension i(nc!*ncl*l),r2(ncl*l),r3(ncl*ncl*l) 

dimension nw(lccl),ff(kcl),dd(lccl)^kcl) 

dimension stud(nal1X tkoln2(naU),tniami(naII) 

dimension iex(ncx) 

c 

character* 10 ndata^ ntime 
data iex/l^,3,4,5,6,7,8,9,!0/ 

data 170.5,0.6,0.7,0.8,0,9,1.0,1.1/ 

data ap/0.5,0.5/ 

data qualit /*mahaloV 

data jmode /*one-leave-out-V 

data mixiq> Pno'/ 

data xanf /'ffileV 

data normaVgaussV 

data Stat /^aramV 

data start fhestcof/ 

c 

sound-redl.txt' 

ill-Ved2.Uf 

c 

OPEN (unit=NT4TIJE=V,F0RM=TORMATTED\STATUS=*uito 
open(unit= 1 1 , fi!e=50imd,fonnF*foiniatted*, 

* status=*old') 

open(unit»22,iiIe«iIl,fonnFvfQnnBtted'» 

* status='*o!d') 

opeD(unit=68,fiIe=*inbestdat'/ormF%niiatted', 

* status- unknown*) 

open(unit==69,file=1)estdat*/armF%nnatted'y 

* status=*unlaiown*) 
c 

wiite{iit,X//30x,"GENE CLUSTER MASTER"/)') 
wite(nt,'(T^umber of slides M - i3)')ni 
write (nt,'("Number of genes NALL =",i5, 

* " sgemiin cluster si2e='',i4," ,to be seaiched:**,i4)') 

* na]],ncls,ncl 

write {nt,'("DATA normalization to(by) = "^AlOj^normal 
write (nt.XType of Statistics Used « "^1 0)')stat 
ifl;ranf.ne/ffi]e')then 

write(nt,*C*Overexprcssion of Poisoned Genes ",f5.1, 

* "Variance ^fS.l)*) 

* genadd,disp 

write(nt,X*'Random Numbers Generator ",alO," jShift^iS)*) 
*** ran^ishilt 
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end if 

write(nt/(//30x, "SIMULATION PARAMETERS"/)*) 
write(nt;("Soiiiid data from: -a30)')soiind 
write{nt;(*TatoIogy data from: "aSO)*)!'!! 
5 write(nt;("SIMULATED PATOLOGY LEVEL: '',f3.1,"+/-",D.l)') 

* pat ,dpat 

write(nt;("Level of mutual Frailty for Cluster. f5.2)0 
♦frailty 

write(nt,("Mixture: ",f5.2."LogNonn +",f5.2"Uiiifonn")') 
10 * purity, 1 .-purity 
c 

write(nt/(//30x,"SEARCH PARAMETERS"/)') 
wrile(nt;("MIXUP the GENES? a]0)')niixup 
write(Bt/{"SEARCH MODE aIO)0quaIit 
1 5 write(nt;("Number of Random Search Trials:"!?)*) 

* niter 

if (nex.ne.O) then 

write(iit;("ATTENSION!, Genes Excluded:7lO(lQi6/))') 
*(iex(i),i=l,nex) 
20 end if 

if(qualiteq/parz*.or.quaIit.eq.TcnnOthen 
witeCnt/CMODE OF BAYES QUALITY ", al0)7ku 
write (nt/C'Number and values of kernels",! 5/ 

* 15 f5.iy)kcWfl;i),W,kcl) 
25 end if 



do i=l,ishifl 
aa=Tndni(-l.) 
end do 

30 c . 

ii(ranf.eq.'uni')then. 

do r=l;Qall*m*2 

b(i)-l.+nidn!(-l.)*disp 

end do 
35 c 

else if (ranf.eq.'nonTico')then 

doi=lpialI*m-l,2 

call noimco(b(i),b(i+l),5.,3.,disp,disp,0.9) 
end do 

40 do i=nall*mf l^all*m*2-l,2 

call nonncoCb(i),b(i+l),5.^.,disp,disp,-0.9) 
end do 

else if{ranf eq/ffile*) then 
call rfroroilb, nall^mj) 
45 c 

else 

write(nt,'("no such data nx>de",alO)')ranf 
stop 67 
end if 

50 

ii(ranf jie.'£Qle*) then 
c 

doj=m^*m-I 
do i=nall*j+l^l*j+ncl 
55 b(i)=b(i>gcnadd 
end do 
end do 

if(nalIleJO)tben 
write (nt,'(10n.2/)')b 
60 end if 
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end if 
c 

ifl[niixup.eq/yes*) then 
do i=I,naJI 
5 ind2(j)=i 
end do 
do i =1, ishifl 
iin=nidni(-l .)*nan+l 
iout=mdm(- 1 .) *nall+ 1 
10 nuraoId=ind2(iout) 
ind2(iout>=ind2(iin) 
ind2 (iin)=nuniold 

ifl[iin.gt.nall.or.iout.gt.nalLor.iin.lt.l.or.iout.U.l)lhcn 
write(*;{"BIGGGGG! ! ! SilSWi^iout 
15 end if 

call exchange(b^a]l,in,l,iinjiout^ii) 
end do 

if (debug.ge.5} then 

write (nt,'("Mixcd Clustei"/! 000(1 (MS/))') 
20 * (ind2(i),i=l,nall) 
end if 
end if 
c 

if (noimal.ne/no*) then 
25 call nonnalization(b,ind2^all^l,stud,ikolni,nonnal) 
end if 
c 

call tests(b,nl22^2,iiall^I^u^tnd,tkohn,tn]ann^t,nc]) 
c 

30 ito(l)=m 
ito(2)=m 
mb^ito(l)+ito(2) 
istg=0 
c 

35 sd=0. 

stiter=l.e20 
c 

doi=l,m*l 
u(i>l./m 
40 end do 

c 

if{start.cq.'bcstcor'.andjiex.ge.2) then 
do i=l,nex 
inum(i)=jex(i) 
45 end do 

call assign(b,imira,cl,nex^all,m,l) 
c 

c write "/10(10f8^/))') 
c*(u(iXi«l,m*I) 
50 c 

can ndsrl(cU2^,u^ex,ito,n]bJ) 
call covci<r,r3,z,nex,ni,l) 

write (nt,'(/25x/'CORRELATION MATRIX710(12i6/))') 
* (iex(i), i=l^x) 
55 write (nt,V10(10f6.2/))') 
*(r3(i),i=I^*nex*ncx) 

write (nt,'(/25x,"FISHER MATRDCVlOClOie/))") 
*(iex(i), i^Jjuex) 
write (nt;(/10(10f6.2/))') 
60 ♦(2(i),i-Mex*ncx) 
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write (nt,'(/"Genes means 75(10f6-2/))') 

*(r2(j),i=l,2*nex) 

c 

call bhafas(r,r2.e,e1 ,e2,e3,rb,nii,rc^ex,qua!it,debug) 
5 write (nt/(rMaha]onobis Distance: ",fl2J2)')rm 
c 

stop 777 
end if 

DOICy=l,NCyCLE 
10 ii=0 

ii][sta]t.eq.YandQn]*) then 
c 

iijF=TOdni(-l .)*naD+ 1 
inuin(l)=iin 
15 c 

do 1=2 ^cl 
88 continue 

inew=nidni(- 1 .)*nall+ 1 
doj=l,i-l 
20 ii(inew.eq.inuni(i-j))then 
go to 88 
else 

inum(i)=inew 
end if 
25 end do 
end do 

else iflstart.eq/lasf) then 

DOI«l,NC3L 

ii=ii+l 

30 inuni(ii)=i+NALL-NCL 
end do 

else ifl[start.eq.*first') then 
do nall-nc],-l 

35 inun)(ii>i+NALl^NCL 
end do 

else if][start.eq/fronibest') then 

rcad(68/i7,el2.4,{10(10i6/))011,qq,(inum(i),i=l,ncl) 

else 

40 stop 9999 
end if 
c 

write (nt/C Initially Selected genes 7 
♦SClOiS/OOinum 
45 DOiteF=l^ter 
c 

if(iter.nB.l)then 

call change(inuin,nall^cMin,iout,niimold,ind34ter,niter, 
*iex,nex) 
50 else 
iin^l 
iout=l 
numold=99 
c 

55 end if 

if(iter.gtl.and.iter.le.5,and.debug.ge.3.) then 
write (nt,'("Iteration",i4,'' Exchanged genes ".3i5/ 

* '"MASK Array"/ 

* 5(10i5/))')iterJIN,iout^unaold,(inuDT(i)^=l,ncl) 
60 endif 
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c 

call assign(b,inuni,cI^c],nalJ,ni,l) 
c 

if][stat.eq.'param')then 
5 call misrl(cI^2;r,u,iicl4to,inb,J) 
c 

if(debug.ge.3)then 
write (nt;("Geiies cov 75(5n2.5/))')r 
call covcr(r,r3,2^cl^l) 
1 0 write (nt,'("Genes cor 75(5fl 2.5/))')r3 
write (nt/C'Genes means ",5(5n 2.2/))7r2 
end if 

else if (stateq.'nonparam')then 

call SPIRl(cl/23,X.V,NCL,MJUni22,ind2^1^nk2) 
] 5 if (debug.ge. 5) then 

write (nt,X"Genes spirmen 75(10fl2.5/))> 

write (nt;{"Genes medians ",5(10n2.20)Or2 

write (nt.X'Genes interQU "^(1002.2/))') 

* (v(i),i=l,ncl*I) 
20 c stop 777 

end if 

end if 

c BHATTACHARYA DISTANCE 
c 

25 if(qualit.eq.*bhata') then 
ss=rb 

else if (qualit.eq.Wlialo*) then 
ss=nn 

else if (qualiteq.'coTcoiO then 

30 ss==rc 
c 

else 

writeCnt/C'no such quality function^alO}^ qualit 
stop 67 
35 end if 
end if 
c 

IF(SS.GT.SD) THEN 
SD=SS 
40 ISTG=ISTG+1 
c 

c REMEMBERING OF BEST VALUES 
c 

c CALL UCOPY(inun^iinbesUicl) 
45 doiu=l,ncl 

in]ibest(iu)=inum(in) 
end do 
ibest=iter 
c 

50 CALLDATIMH(NDATA,NT1MH) 
c 

WRITE(*;("SUCCESS at: ".A12,2X^12. 

♦ •* ITERATION", i7," QUALITY",el4.6)') 

* NDATA»NTTME,ITER,SD 

55 write(*;(10(10i5y))0 (imim(i),i=Mcl) 
if^debug.ge.2)then 

WRITE(nt,X"GOOI>!n,Itcration and Q",i7,el4.6)')ITER,SD 
write(nt;(10(10i5y))') (inum(i),i-l^l) 
end if 

60 ELSE IF(SS.LE.SD) THEN 
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inum(iout)=numold 
if{debug.ge.3.and.iter.le.lO)then 
CALL DATIMH(NDATA,NTIME) 

WRITE(nt/("BAAD!!!,QDew and Qbest",i7;2el4.6)')ITER,SS,SD 
5 end if 
END IF 

if(SD.GT.STlTER) then 

write (NT;("REQUIRED DISTANCE ACHIEVED r",2c 1 5.3)0 
10 ♦SD,SnTER 
goto 18 
end if 

c 

END DO 
IS IScontmue 

writc(nV(25x," CYCLE N " i6/)')ICY 
write(nV('»Distancc used : "^6," Quality=",el53)') 

* qualit,sd 

write(nt/("Niimbcr of successful steps : ",i5)')istg 
20 write(nt,X"Best Ouster Obtained Aften 
*i9/20(10i7/))') 

* mter,imbest 
c rewind 68 

write(69,'{i7,n2.4,20i6)') ibest,sd,imbest 
25 c write(69;(i7,fl2.4)*) ibesl,sd 
END DO 
c 

stop 
end 
30 c 

subroutine lests(b^2Jaid2,nall^,x,u^d,tkohn,tiiiann^t^cl) 

dimension b(na]I,n],])^tud(naIlX *JcoIin(naII),tniajin(nall) 

dimension x(in*l),u(n}*I),m22(in*I),ind2(nal]) 

134=^*0.75 
35 il4«ni*0^5 

i5=«in*0.5+I 

do i=I,nall 

doj=l^ 

x0>b(ij,l) 
40 uO)=b(y^) 

end do 

can sortzv(x,ni22^1,l,0,0) 
xd=(x(m22{i34))-x{ni22{il4)))/135 
xm=x(ni22(i5)) 
45 call soitzv(u,m22, 1 ^ 1 ,0,0) 

ud=(ii(in22(i34))-u(m22(il4)))/l .35 
um=u(ni22(i5)) 

5tud{i)=abs(xm-um)/sqil{xd*xd+iid*ud) 
end do 

50 call sortzv(stud^d2^all, 1 , 1 ,0,0) 

write (nt,X"N.Studcnt Quster" 

*/1000(10i8/))') 

*(ind2(04=Mcl) 

call eirors(ind2^nc],na]l) 
55 c 

do i=^l^l 

xni=0. 

xm2=0. 

um=0. 
60 uni2=0. 
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doj=l^ 

xa)=b{ij,3) 

uO>b(ij.2) 

end do 
5 doj=l,m 

xm=xnH-x(j) 

xin2=xiTi2+x(j)*xO*) 

uin=uni+u(j) 

um2=um2+u(j)*u(i) 
10 end do 

xin=xin/m 

uin=uni/m 

xd=xni2/ni-xni*xm 

ud=uin2/in-um*uni 
15 stud(i)-abs(xnihumysqrt(xd-l-ud) 

end do 

call soitzv(stud^2^Il,l,l,0,0) 

write (nVCTaram Student Qustei*/! 000(1 OiS/))') 

*(md2(i),i=l^cl) 
20 call erroT^ind2;nt,i]£l,i]aU) 

do i»l,nall 

doj=l^ 

xO>b(ijM) 

x(j+m)=b(io;2) 
25 end do 

CALL UTEST(x,u,m^,tinaiin(i).2U JERR) 
end do 

call sortzv(tmaiiii,md2^all, 1,0,0,0) 
write (nt,'("Mami-Whitney Cluster^Vl 000(1 OiS/))') 
30 *(ind2(i).i-l,ncl) 

call eiTors(iiid2,nt,ncl,nall) 

doi=l,nall 

do j=l,m 

xCi)-b(y,i) 

35 u(,>b(y^) 
end do 

CALL ko]in2(x,UAn],tkolm(i)^rob) 
end do 

call s6Ttzv(tkoInMnd2,iiaII,l,l,0,0) 
40 write (nt,'("KoImogorov CIuster71 000(1018/))') 

*(md2(i)4=l,ncl) 

call error5(ind2,iit,nc],nall) 

retum 

end 
45 c 



Example 3: a Source Code Segment Implementing Integration of The 
50 Results from Local Searches to Build a Larger Set of Genes - Steps 3 
and 4 Of Example 1 

Program genecount 
c 

55 parameter (nalNlOOO, ncliist=5, iitrial==10000,ncut=10,iir=22^t=2) 
parameter (nctnie=20,ipat=l 4itiipw=l^tidw==]7»memw=100000) 
parameter (debug=2.) 
c 

dimension a(nclusl*ntrial),c(nall),cut{ncut),geiiprop(ncIust) 
60 dimension sel(naH) 
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dimension tonttipIe(nclusl+3),ind(nalI;nalI),indl (nail) 
character's 0 selgen 
character*8 mode 

data cut/0.000005,0.00001 ,0.00005,0.001 ,0.002,0.003,0.01 ,0.03, 
5 * 0.05,0.08/ 

data cutpair/O.l/ 
data cpair/0.003/ 
data selgen /T)est.datV 
data mode/'siniV 
10 data niter /500000/ 



CHARACTER*! opmo 
CHARACTER*50 hbname 
CHARACTER'S tek(nclust+3) 
15 DATA qpmo/X'/.LMCLR/1024/,LRECLW/1024/ 



OPEN (UNrr=NT,FILE=T>.counf , FORM=TORMATTED',STATU&=tJNKNOWN*) 
open(unit=nr,n]e==selgen/oiin»'foniiatted',status='old*) 



20 hbnanie=*genomeJibook' 

tBk(l)='lastb' 

telc(2Kquality' 

tek(3)='N_of^gen* 

tek(4Kgener 
25 tek(5Kgene2' 

tek(6>=*gene3' 

tek(7)«'gene4' 

tek(8Kgene5' 



30 ctek(i>'geiieV/icliai(i-2) 
c end do 

ii{Dtiipw.gt.O} tfaen 

call HROPEN(nlidw;ani98'Jibnanie.'N*,lreclw.ISTAT) 
end if 

35 call HBOOKN(ntidw,*GENE SELECmON',nchist+3, 

♦ V/ara98*,memw,tek) 

write(nt,VlOx,"GENE SORTER FOR AlO, 

♦ " EARLY STOP AT",I8)>oodc, NITER 
qmean=0. 

40 nmean^O 

ntrj=0 

ncount=l 

doi=l,nclust 

gei3prop(i)=0. 
45 end do 

doj=l,ntrial 

rcad(22,* ern=l 00,end=99)nlast,quality, 

♦ (a(i),i==ncount,nconntHic]iisM) 
tontiiple(l)=nlast 

50 tonti]pIe(2)=quaIity 

kkFO 

do i=4,nclust+3 

55 tontuple(i)=a(nconnt+y) 

if(tontuple(i).le.nctrue) tfaen 

genprop(i-3)«genprop(i-3)+l . 

kk=kk+l 

end if 
60 end do 



c 



c 



c 



c 
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tontupIe(3)=kk 
call HFN(ntidw,tontuple) 
ncouiit=Ticounl+ncIust 
ntrj=Dtrj+l 
5 ninean=iiinean+nlas( 
qmean^qmean+quality 
end do 
go to 99 
1 00 continue 



1 0 write(*,X"ERROR IN HSJPUT STREAM ON LINE: "^^Tf)} 



99 ccmtinue 

1 5 write (iit;{i7," Random Starts, Rm and Last "^2.4,i7)*) 
* ntrj.qineaii/ntrjpuneaii/ntij 



if (mode.eq.'sim*) then 
write {nt;(/"% Of Tnie".10(5fl2.4/))') 
20 * {genprop(j)/ntrj, i=] ^clust) 
end if 

call vzero(c,nall) 

do i=l^clust*ntrj 

do k=l^ll 
25 realk=real(k) 

if(a(i).eqjrealk) then 

sel(k)=sel{k)+l 

c(k>^(k)+l./ntrj 

end if 
30 end do 

end do 

doj=l,ncut 

do i=l,nal] 

indl(i)=0 
35 end do 

ncoiint=0 

do i=l,nan 

if(c(i).gt.cut(j))then 

indl(i>l 
40 if (dcbug.ge.3) then 

write(nt/C'GENE "^5,5x, 

* "Appearance % ";F12i)')I,C(l) 

end if 

ncomit=ncomit+ 1 
45 END IF 
end do 



eirl=0. 
eiT2=0, 
50 do.i=l,nall 

iQindl (i).eq. I .and.i.Ie.nctrue)then 
eirl=errl+l. 

else il^ind 1 (i).eq. 1 .and.i.gtnctme)then 
eir2=err2+L 
55 end if 
end do 

write(nt/(/"N of genes selected with CUT ",F9 J,i8 )') 
* cut(j),ncount 
if (mode.eq/sini'} then 



60 write(nt;ri eiron " J9.5,", 2 eiTor,F9.5)') 



c 



stop 



c 



c 



c 
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* l.-errl/nctnie,err2/na3I 

write{nt;rEta «= 1.- leiTor/sqrt(2eiror): "J?12.5)') 

* erTl/nctrue/sqi1(eiT2/nall) 
end if 

5 end do 

if (debug, ge.4.) then 

ncount=0 

do 3=1 ^all 

doj^l^all 
10 ind(ij)-0 

end do 

end do 

doni=l^trj 

doj="],nclusM 
15 kl'^ifixCaCncount-t-j)) 

doi=j+l,ncIust 

!c2=dfix(a(ncount+0) 

ind(kl,k2)=md(kl^)+l 

end do 
20 end do 

ncount=ncoiint+nclust 

end do 

c 

do i=l^n 
25 do j==i+ 1,11311 
c 

prop=real(ind(ij))/sel(i') 

if(prop.ge.cutpair.and.c(i}.ge.q)air)then 

write(nt,'('Trcq. for geiics:";2i63fl2.5)') 
30 c * "Single frcquencies:",9x;2fl2^)') 

*iJ,piop,c(i).c(j) 

end if 

end do 

end do 
35 end if 

c 

if{ntupw.gt.O) then 

call hrout(o,icycle; •) 

can HRENDCani98') 
40 end if 
STOP 
END 



45 Example 4: MIcroarray Expression Analysis Using Cells from Two 
Colon Cancer Cell Lines 

HT29 cells represent advaaced, highly aggressive colon tumors. They 
contain mutations in both flie APC gene and p53 gene, two tumor suppressor 
50 genes that frequently mutate during colon tumorigenesis. HCTl 1 6 cells 
manifest less aggressive colon tumors and harbor functional p53 and APC. 
They are defective in DNA repair. The experiment was performed with three 
RNA samples (1 jxg RNA each). Cy-3-dCTP (green) was used to label 
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20 



HCTl 1 6 cells while Cy-5-dCTP (red) was used for HT29 cells. Each 
comparison set was hybridized against two microairay slides (facing each 
other) containing 4608 minimally redundant cDNAs spotted in duplicate. As 
control, six Drosophila genes were added to the Cy-5 samples. Thus, in a red 
vs. green comparison they are differentially expressed by design. This 
experiment resulted in a total of twelve measurements on each channel for each 
gene on the microairays. Although a nested dependence structure existed in 
the samples, the analysis assumes them as independent replicates. 
Additionally, ten HCTl 16 samples hybridized with Cy-5 (red) from a separate 
experiment were included in the analysis. 

Two comparisons were performed: (i) HCTl 16 vs. HT29 and (ii) 
HCTl 16 (green) vs. HCTl 16 (red); the first is inter cell lines whereas the 
second is intra cell lines. The relevant parameters for the random search were 
set: Ncycie = 10,000, Hubsei = 5; and, the Mahalanobis distance was used as the 
quality function. 

Referring to Fig. 4, the left panel corresponds to the comparison of the 
different cell lines (as the case (i) above) whereas the right panel to the 
comparison of the same cell line on different channels (as the case (ii) above). 
The histograms of the last best iteration (the top two graphs) are very similar in 
both cases; neither has reached the global maximum. That is, in both cases, the 
procedure kept exploring the local maxima due to the early stopping. 
However, turning to the bottom two graphs, the distribution of the estimated 
Mahalanobis distances at these local maxima in each case is very different from 
each other: When different cell lines were compared, i.e., in the case (i) above, 
the Mahalanobis distances based on the locally optimal subsets tended to be 
much larger than those in the case (ii) above when the same cell lines were 
compared. Therefore, the separation of the two tissues was considerably better 
in case (i) than in case (ii), as one would expect. 
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Referring to Fig. 5, the first 115 genes ordered according to the 
decreasing frequency of occuirence in the selected subsets are plotted. The 
white columns represent genes from same cell line samples without control 
whereas the black colunms represent genes from different cell line samples. In 
addition, the gray columns represent genes from same cell lines samples with 
control. As shown, the right tails of the histogranls are very close to each 
other. Some of the genes in the HCTl 1 6/HT29 comparison (the black 
columns) are selected more often - i.e., have higher frequency - than expected 
under the null hypothesis of no difference between the two tissues (the white 
columns). Interestingly, in the case with same cell line without control (the 
white columns), only two genes had a frequency that was higher than 3%; and, 
when the control genes were included (the gray columns), this number 
increased to six and four out of the top five genes (Nos. 1, 2, 3, and 5 on the x 
axis) were actially Drosophila control genes. 

A frequency level of 1% was selected as the cutoff for identifying 
difrerentially expressed genes. Total 59 genes were selected and thus 59 
cDNA spots were identified on the slides. A comparison was carried out 
between the 59 cDNA spots and the top 59 genes selected by t-statistic. 
Almost half of those genes (25 to be exact) were identified by both methods. 
However, a characteristic advantage of the multivariate random search 
procedure was its ability to identify correlated genes. Some of the genes had 
several corresponding spots on the slides, and therefore their expression levels 
at various spots were known to be correlated. Among tiie 59 genes identified 
by the multivariate random search method, 13 had two, and two had tiboree spots 
inter-related to each other. By comparison, among the genes identified by the 
marginal t-statistic, 17 genes had two or more replicates on the slides, and only 
one of them had all of its replicates selected in the resulting list of genes. 
Therefore, the improved random search procedure of this invention is powerful 
in identifying less pronounced differentially expressed genes when they are 
correlated with more strongly differentially expressed genes. 
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It is to be understood that the description, specific examples and data, 
while indicating exemplary embodiments, are given by way of illustration and 
are not intended to limit the present invention. Various changes and 
modifications within the present invention will become apparent to the skilled 
artisan from the discussion, disclosure and data contained herein, and thus are 
considered part of the invention. 
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CLAIMS 

1 . A method for identifying a set of genes from a multiplicity of 
genes whose expression levels at a first state and a second state are measured ir 
replicates using one or more nucleotide arrays, thereby generating a first 
plurality of independent measurements of the expression levels for said first 
state and a second plurality of independent measurements of the expression 
levels for said second state, which method comprises: 

(a) identifying a quality function capable of evaluating the 
distinctiveness between the first plurality and the second plurality; 

(b) selecting a subset of genes, whose expression levels in said first state 
and second state are represented in said first plurality and said second plurality, 
respectively; 

(c) calculating the values of the quality function for said subset of genes 
in said first state and said second state based on the first and second plurality, 
thereby determining the distinctiveness of the first and the second plurality; 

(d) substituting a gene in said subset with one outside of said subset, 
thereby generating a new subset, and repeating step (c), keeping the new subset 
if the distinctiveness increases and the original subset if otherwise; 

(e) repeating steps (c) and (d) for a first predetermined number of times, 
thereby identifying a locally optimal subset of genes; 

(f) repeating steps (b) to (e) for a second predetermined number of 
times, thereby identifying said second predeteimined number of the locally 
optimal subsets; and 

(g) integrating said second predetermined number of the locally optimal 
subsets into said set of genes, wherein said set is larger than said locally 
optimal subsets in size* 
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2. The method of claim 1 , wherein said states are selected from the 
group consisting of biological states, physiological states, pathological states, 
and prognostic states. 

3. A method for identifying a set of genes fiom a multiplicity of 
genes whose expression levels at a first tissue and a second tissue are measured 
in replicates using one or more nucleotide arrays, thereby generating a first 
plurality of independent measurements of the expression levels for said first 
tissue and a second plurality of independent measurements of the e?q>ression 
levels for said second tissue, which method comprises: 

(a) identifying a quality ftmction capable of evaluating the 
distinctiveness between the first plurality and the second plurality; 

(b) selecting a subset of genes, whose expression levels in said first 
tissue and second tissue are rq^resented in said first plurality and said second 
plurality, respectively; 

(c) calculating the values of the quality function for said subset of genes 
in said first tissue and second tissue based on the first and second plurality, 
thereby determining the distinctiveness of the first and the second plurality; 

(d) substituting a gene in said subset with one outside of said subset, 
thereby generating a new subset, and repeating step (c), keeping the new subset 
if tiie distinctiveness increases and the original subset if otherwise; 

(e) repeating steps (c) and (d) for a first predetermmed number of times, 
thereby identifying a locally optimal subset of genes; 

(f) repeating steps (b) to (e) for a second predetermined number of 
times, thereby identifying said second predetermined number of the locally 
optimal subsets; and 
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(g) integrating said second predeteimined number of the locally optimal 
subsets into said set of genes, wherein said set is larger than said locally 
optimal subsets in size. 

4. The method of claim 3, wherein said tissues are selected from the 
group consisting of normal lung tissues, cancer lung tissues, normal heart 
tissues, pathological heart tissues, normal and abnormal colon tissues, normal 
and abnormal renal tissues, normal and abnormal prostate tissues, and normal 
and abnormal breast tissues. 

5. A method for identifying a set of genes from a multiplicity of 
genes whose expression levels in a first type of cells and a second type of cells 
are measured in replicates using one or more nucleotide arrays, thereby 
generating a first plurality of independent measurements of the expression 
levels for said first type of cells and a second plurality of independent 
measurements of the expression levels for said second type of cells, which 
method comprises: 

(a) identifying a quality function capable of evaluating the 
distinctiveness between tfie first plurality and the second plurality, 

(b) selecting a subset of genes, whose expression levels in said first type 
of cells and said second type of cells are represented in said first plurality and 
said second plurality, respectively; 

(c) calculating the values of the quality function for said subset of genes 
in said first type of cells and said second type of cells based on the first and 
second plurality, thereby determining the distinctiveness of the fust and the 
second plurality; 

(d) substituting a gene in said subset with one outside of said subset, 
thereby generating a new subset, and repeating step (c), keeping the new subset 
if the distinctiveness increases and the original subset if otherwise; 
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(e) repeating steps (c) and (d) for a first predetermined number of times, 
thereby identifying a locally optimal subset of genes; 

(f) repeating steps (b) to (e) for a second predetermined number of 
times, thereby identifying said second predetermined number of the locally 
optimal subsets; and 

(g) integrating said second predetermined number of the locally optimal 
subsets into said set of genes, wherein said set is larger than said locally 
optimal subsets in size. 

6. The method of claim 5, wherein said types of cells are selected 
from the group consisting of normal lung cells, cancer lung cells, normal heart 
cells, pathological heart cells, normal and abnormal colon cells, norma] and 
abnormal renal cells, normal and abnormal prostate cells, and normal and 
abnormal breast cells. 

7. The method of claim 5, wherein said type of cells are selected 
from the group consisting of cultured cells and cells isolated from an organism. 

8. The method of claim 1, 3, or 5, wherein said integrating is 
performed by selecting the genes whose frequency of occurrences in said 
second predetermined number of the final subsets exceeds a third 
predetermined number. 

9. The method of claim 8, wherein said third predetermined number 
is 1% or 5%. 

1 0. The method of claim 1 , 3, or 5, wherein said first predetennined 
number is sufficiently small such that the global maximum is not reached. 

1 1 . The method of claim 1, 3, or 5, wherein said quality function is a 
parametric frinction or a non-parametric function. 
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12. The method of claim 1 1 , wherein said parametric function is 
selected ftom the group consisting of the Mahalanobis distance and the 
Bhattacharya distance. 

13. The method of claim 1, 3, or 5, wherein the nucleotide anaj« 
selected from the group consisting of arrays having spotted thereon cDNA 
sequences and arrays having synthesized thereon oligonucleotides. 
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